02. Containment
Containment
One of your first tasks will be to create containment features that first look at a whole body of text (and count up the occurrences of words in several text files) and then compare a submitted and source text, relative to the traits of the whole body of text.
L4 03 Containment V1 V4
Calculating containment
You can calculate n-gram counts using count vectorization, and then follow the formula for containment:
If the two texts have no n-grams in common, the containment will be 0, but if all their n-grams intersect then the containment will be 1. Intuitively, you can see how having longer n-gram's in common, might be an indication of cut-and-paste plagiarism.